In your final group assignment you have to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. You can download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.

The following Google sheet shows which cities you can use; please choose one of them and add your group name next to it, e.g., A7, B13. No city can have more than 2 groups per stream working on it; if this happens, I will allocate study groups to cities with the help of R’s sampling.

All of the listings are a GZ file, namely they are archive files compressed by the standard GNU zip (gzip) compression algorithm. You can download, save and extract the file if you wanted, but vroom::vroom() or readr::read_csv() can immediately read and extract this kind of a file. You should prefer vroom() as it is faster, but if vroom() is limited by a firewall, please use read_csv() instead.

vroom will download the *.gz zipped file, unzip, and provide you with the dataframe.

Even though there are many variables in the data frame, here is a quick description of some of the variables collected, and you can find a data dictionary here

  • price = cost per night

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood*: three variables on a few major neighbourhoods in each city

1 Exploratory Data Analysis (EDA)

In the R4DS Exploratory Data Analysis chapter, the authors state:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation… EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Conduct a thorough EDA. Recall that an EDA involves three things:

  • Looking at the raw values.
    • dplyr::glimpse()
  • Computing summary statistics of the variables of interest, or finding NAs
    • mosaic::favstats()
    • skimr::skim()
  • Creating informative visualizations.
    • ggplot2::ggplot()
      • geom_histogram() or geom_density() for numeric continuous variables
      • geom_bar() or geom_col() for categorical variables
    • GGally::ggpairs() for scatterplot/correlation matrix
      • Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = gender, alpha = 0.4)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:

1.1 Raw values, summary statistics, NAs and informative visualizations

  • How many variables/columns? How many rows/observations?

Using the glimpse() function with the dataset, there are 74 variables/columns and 17,703 rows/observations. The variables are split up in the following categories: character (24), date (5), logical (8), numeric (37)

  • Which variables are numbers?

The variables that are numbers can be found via the skim() function under “Variable type: numeric” and for instance are host_listings_count, latitude, accommodates, beds, minimum_nights and availability_60

  • Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?

The variables that are characters are those belonging to “Variable type: character” and are for instance name, listing_url, description and host_name. Those that are factor variables are labelled as “logical”, given they take a certain value based on a logical test. This is the case for bathrooms, host_is_superhost, instant_bookable. For a few of these variables there are numbers missing or NAs so we need to clean up data. To examine the distribution of categorical/factor variables, a bar chart looks like the best option

Below the analysis of raw values and summary statistics:

glimpse(listings) # We can see that a few variables have NA, N/A or values that depend on a logical test
Rows: 17,703
Columns: 74
$ id                                           <dbl> 6400, 23986, 28300, 37256~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped                                 <date> 2021-09-20, 2021-09-20, ~
$ name                                         <chr> "The Studio Milan", "\" C~
$ description                                  <chr> "Enjoy your stay at The S~
$ neighborhood_overview                        <chr> "The neighborhood is quie~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 13822, 95941, 121663, 119~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Francesca", "Jeremy", "M~
$ host_since                                   <date> 2009-04-17, 2010-03-19, ~
$ host_location                                <chr> "Milan, Lombardia, Italy"~
$ host_about                                   <chr> "I'm am Francesca Sottila~
$ host_response_time                           <chr> "N/A", "N/A", "N/A", "N/A~
$ host_response_rate                           <chr> "N/A", "N/A", "N/A", "N/A~
$ host_acceptance_rate                         <chr> "N/A", "N/A", "N/A", "N/A~
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Zona 5", "Navigli", "Cen~
$ host_listings_count                          <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_total_listings_count                    <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood                                <chr> "Milan, Lombardy, Italy",~
$ neighbourhood_cleansed                       <chr> "TIBALDI", "NAVIGLI", "SA~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude                                     <dbl> 45.44195, 45.44991, 45.47~
$ longitude                                    <dbl> 9.17797, 9.17597, 9.17359~
$ property_type                                <chr> "Private room in rental u~
$ room_type                                    <chr> "Private room", "Entire h~
$ accommodates                                 <dbl> 1, 4, 2, 1, 4, 4, 5, 3, 2~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "3.5 baths", "1 bath", "1~
$ bedrooms                                     <dbl> 3, 1, 1, 1, 2, 2, 2, 2, 1~
$ beds                                         <dbl> 1, 1, 2, 1, 4, 2, 3, 1, 1~
$ amenities                                    <chr> "[\"Hangers\", \"Iron\", ~
$ price                                        <chr> "$100.00", "$150.00", "$1~
$ minimum_nights                               <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights                               <dbl> 5, 730, 14, 730, 90, 30, ~
$ minimum_minimum_nights                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_minimum_nights                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ minimum_maximum_nights                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ maximum_maximum_nights                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ minimum_nights_avg_ntm                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights_avg_ntm                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 23, 28, 30, 0, 0, 23, 0, ~
$ availability_60                              <dbl> 53, 58, 60, 0, 0, 53, 0, ~
$ availability_90                              <dbl> 83, 88, 90, 0, 0, 83, 0, ~
$ availability_365                             <dbl> 358, 363, 365, 0, 203, 35~
$ calendar_last_scraped                        <date> 2021-09-20, 2021-09-20, ~
$ number_of_reviews                            <dbl> 12, 15, 8, 34, 37, 14, 27~
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review                                 <date> 2014-04-11, 2015-09-21, ~
$ last_review                                  <date> 2010-04-19, 2020-09-07, ~
$ review_scores_rating                         <dbl> 4.89, 4.64, 4.71, 4.90, 4~
$ review_scores_accuracy                       <dbl> 5.00, 4.53, 4.71, 4.79, 4~
$ review_scores_cleanliness                    <dbl> 5.00, 4.40, 4.86, 4.90, 4~
$ review_scores_checkin                        <dbl> 5.00, 4.40, 4.86, 5.00, 5~
$ review_scores_communication                  <dbl> 5.00, 4.53, 4.86, 5.00, 4~
$ review_scores_location                       <dbl> 4.56, 4.53, 5.00, 5.00, 4~
$ review_scores_value                          <dbl> 4.67, 4.53, 5.00, 4.59, 4~
$ license                                      <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, TRUE~
$ calculated_host_listings_count               <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_entire_homes  <dbl> 0, 1, 0, 1, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 0.13, 0.21, 0.11, 0.47, 0~
glimpse(listings$price) # When we focus on prices, we can see that they're expressed is US dollars and are not categorized as numeric variable due to the presence of the "$" sign in front of each value
 chr [1:17703] "$100.00" "$150.00" "$180.00" "$55.00" "$75.00" "$180.00" ...
skim(listings) # There are several variables with complete_rate that is not close to 1, meaning that there are many NAs encountered throughout the 17,703 observations
Data summary
Name listings
Number of rows 17703
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 17703 0
name 10 1.00 1 255 0 17460 0
description 415 0.98 1 1000 0 16920 0
neighborhood_overview 7400 0.58 1 1000 0 9456 0
picture_url 0 1.00 61 126 0 17502 0
host_url 0 1.00 38 43 0 11915 0
host_name 1 1.00 1 35 0 2861 0
host_location 42 1.00 2 82 0 919 0
host_about 7753 0.56 1 5519 0 5665 16
host_response_time 1 1.00 3 18 0 5 0
host_response_rate 1 1.00 2 4 0 65 0
host_acceptance_rate 1 1.00 2 4 0 91 0
host_thumbnail_url 1 1.00 55 106 0 11847 0
host_picture_url 1 1.00 57 109 0 11847 0
host_neighbourhood 5044 0.72 4 33 0 84 0
host_verifications 0 1.00 2 158 0 291 0
neighbourhood 7400 0.58 12 83 0 44 0
neighbourhood_cleansed 0 1.00 4 28 0 86 0
property_type 0 1.00 4 35 0 52 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 17 1.00 6 16 0 25 0
amenities 0 1.00 2 1596 0 16087 0
price 0 1.00 5 10 0 513 0
license 16382 0.07 1 110 0 1122 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-19 2021-09-20 2021-09-20 2
host_since 1 1.00 2009-02-19 2021-09-14 2015-04-29 3289
calendar_last_scraped 0 1.00 2021-09-19 2021-09-20 2021-09-20 2
first_review 4532 0.74 2011-04-18 2021-09-20 2018-07-27 2593
last_review 4532 0.74 2010-04-19 2021-09-20 2019-10-20 2042

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 1 1 0.17 FAL: 14638, TRU: 3064
host_has_profile_pic 1 1 1.00 TRU: 17627, FAL: 75
host_identity_verified 1 1 0.79 TRU: 13901, FAL: 3801
neighbourhood_group_cleansed 17703 0 NaN :
bathrooms 17703 0 NaN :
calendar_updated 17703 0 NaN :
has_availability 0 1 0.99 TRU: 17503, FAL: 200
instant_bookable 0 1 0.40 FAL: 10657, TRU: 7046

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.670292e+07 15763634.89 6.400000e+03 1.232248e+07 2.712945e+07 4.037565e+07 5.234077e+07 ▇▆▆▇▇
scrape_id 0 1.00 2.021092e+13 0.00 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 ▁▁▇▁▁
host_id 0 1.00 8.730714e+07 106936802.81 8.168000e+03 1.312530e+07 3.202901e+07 1.351926e+08 4.229571e+08 ▇▂▁▁▁
host_listings_count 1 1.00 1.674000e+01 68.04 0.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 1.507000e+03 ▇▁▁▁▁
host_total_listings_count 1 1.00 1.674000e+01 68.04 0.000000e+00 1.000000e+00 1.000000e+00 4.000000e+00 1.507000e+03 ▇▁▁▁▁
latitude 0 1.00 4.547000e+01 0.02 4.539000e+01 4.545000e+01 4.547000e+01 4.549000e+01 4.553000e+01 ▁▂▇▇▁
longitude 0 1.00 9.190000e+00 0.03 9.070000e+00 9.170000e+00 9.190000e+00 9.210000e+00 9.280000e+00 ▁▂▇▆▁
accommodates 0 1.00 3.000000e+00 1.49 1.000000e+00 2.000000e+00 3.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
bedrooms 1394 0.92 1.250000e+00 0.57 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.600000e+01 ▇▁▁▁▁
beds 172 0.99 1.760000e+00 1.15 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 5.510000e+00 27.14 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 3.823783e+04 5010544.58 1.000000e+00 2.900000e+01 3.650000e+02 1.125000e+03 6.666667e+08 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 5.390000e+00 26.93 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 6.000000e+00 27.20 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 1.125000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 3.837286e+04 5010543.57 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 6.666667e+08 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 3.839390e+04 5010543.41 1.000000e+00 3.100000e+01 1.125000e+03 1.125000e+03 6.666667e+08 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 5.730000e+00 27.06 1.000000e+00 1.000000e+00 2.000000e+00 3.000000e+00 1.125000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 3.838416e+04 5010543.48 1.000000e+00 3.100000e+01 1.125000e+03 1.125000e+03 6.666667e+08 ▇▁▁▁▁
availability_30 0 1.00 8.510000e+00 11.02 0.000000e+00 0.000000e+00 0.000000e+00 1.900000e+01 3.000000e+01 ▇▁▁▁▂
availability_60 0 1.00 2.099000e+01 23.57 0.000000e+00 0.000000e+00 7.000000e+00 4.600000e+01 6.000000e+01 ▇▁▁▂▃
availability_90 0 1.00 3.542000e+01 36.58 0.000000e+00 0.000000e+00 1.900000e+01 7.600000e+01 9.000000e+01 ▇▁▂▂▅
availability_365 0 1.00 1.408300e+02 135.32 0.000000e+00 0.000000e+00 9.000000e+01 2.680000e+02 3.650000e+02 ▇▃▂▂▅
number_of_reviews 0 1.00 2.493000e+01 59.78 0.000000e+00 0.000000e+00 4.000000e+00 2.000000e+01 9.140000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.620000e+00 7.56 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.230000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 4.900000e-01 1.37 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.900000e+01 ▇▁▁▁▁
review_scores_rating 4532 0.74 4.580000e+00 0.83 0.000000e+00 4.540000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 4829 0.73 4.760000e+00 0.43 0.000000e+00 4.700000e+00 4.890000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 4828 0.73 4.700000e+00 0.49 0.000000e+00 4.600000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 4830 0.73 4.820000e+00 0.40 0.000000e+00 4.800000e+00 4.960000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 4828 0.73 4.830000e+00 0.40 0.000000e+00 4.820000e+00 4.970000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 4830 0.73 4.710000e+00 0.41 0.000000e+00 4.600000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 4830 0.73 4.610000e+00 0.49 0.000000e+00 4.500000e+00 4.720000e+00 4.900000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.305000e+01 44.33 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+00 3.130000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.232000e+01 44.33 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.130000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 6.500000e-01 2.22 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 2.800000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 5.000000e-02 0.45 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+00 ▇▁▁▁▁
reviews_per_month 4532 0.74 9.200000e-01 1.38 1.000000e-02 1.000000e-01 3.600000e-01 1.070000e+00 1.500000e+01 ▇▁▁▁▁
skim(listings$price) # There are 513 different unique price fares for the whole Milan neighbourhood
Data summary
Name listings$price
Number of rows 17703
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 5 10 0 513 0
favstats(listings$beds) # AirBnBs in Milan have on average of 1.75 beds with more than 50% of accommodations having 1 bed. The observations with bed number equal to zero might be simple studios for daily rent or other types of accommodations not including a bed
minQ1medianQ3maxmeansdnmissing
0112161.761.1517531172
ggplot(data = listings) +
  geom_histogram(mapping = aes(x = beds), binwidth = 1) + 
  labs(title = "Distribution number of beds") +
  ylab("Count") +
  xlab("Beds") +
  scale_y_continuous(breaks = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000)) # The distribution is evidently right-skewed

Below we proceed with data visualizations:

#Example on visualizing distribution and number of occurrences for categorical variables (character, factor). Open graph in new window in order to visualize the right proportions

ggplot(data = listings) +
  geom_bar(mapping = aes(y = neighbourhood_cleansed, fill = neighbourhood_cleansed)) +
  labs(title = "Number of listings per neighbourhood") +
  xlab("Count") +
  ylab("Neighbourhoods") +
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1500), breaks = c(250, 500, 750, 1000, 1250)) +
  theme(legend.position = "none",
        axis.text.x = element_text(size = 3),
        axis.text.y = element_text(size = 3))

listings %>% 
  count(neighbourhood_cleansed)
neighbourhood_cleansedn
ADRIANO44
AFFORI130
BAGGIO46
BANDE NERE309
BARONA41
BICOCCA48
BOVISA188
BOVISASCA9
BRERA693
BRUZZANO41
BUENOS AIRES - VENEZIA1341
CANTALUPA5
CENTRALE622
CHIARAVALLE4
CITTA' STUDI434
COMASINA25
CORSICA202
DE ANGELI - MONTE ROSA295
DERGANO236
DUOMO992
EX OM - MORIVIONE95
FARINI103
FIGINO6
FORZE ARMATE57
GALLARATESE122
GARIBALDI REPUBBLICA227
GHISOLFA225
GIAMBELLINO302
GIARDINI PORTA VENEZIA19
GRATOSOGLIO - TICINELLO31
GRECO126
GUASTALLA377
ISOLA621
LAMBRATE106
LODI - CORVETTO304
LORENTEGGIO60
LORETO733
MACIACHINI - MAGGIOLINA254
MAGENTA - S. VITTORE432
MAGGIORE - MUSOCCO62
MECENATE76
MUGGIANO3
NAVIGLI836
NIGUARDA - CA' GRANDA120
ORTOMERCATO30
PADOVA309
PAGANO186
PARCO AGRICOLO SUD4
PARCO BOSCO IN CITT…5
PARCO DEI NAVIGLI8
PARCO DELLE ABBAZIE9
PARCO FORLANINI - ORTICA46
PARCO LAMBRO - CIMIANO145
PARCO MONLUE' - PONTE LAMBRO21
PARCO NORD5
PARCO SEMPIONE24
PORTA ROMANA366
PORTELLO89
QT 841
QUARTO CAGNINO27
QUARTO OGGIARO58
QUINTO ROMANO13
QUINTOSOLE2
RIPAMONTI90
ROGOREDO35
RONCHETTO DELLE RANE1
RONCHETTO SUL NAVIGLIO100
S. CRISTOFORO253
S. SIRO101
SACCO7
SARPI795
SCALO ROMANA136
SELINUNTE145
STADERA245
TIBALDI266
TICINESE732
TORTONA415
TRE TORRI25
TRENNO12
TRIULZO SUPERIORE14
UMBRIA - MOLISE231
VIALE MONZA216
VIGENTINA256
VILLAPIZZONE330
WASHINGTON373
XXII MARZO535
#Example on visualizing distribution and number of occurrences for continuous variables (numeric, dates)

ggplot(data = listings) +
  geom_histogram(mapping = aes(x = host_since), binwidth = 200) + 
  labs(title = "Distribution of host registration year on AirBnB") +
  xlab("Year of registration") +
  ylab("Count")

  • Which values are the most common? Why?

The distribution of host registration is close to normal, with a fat right tail. Most hosts in the city of Milan registered on AirBnB in 2015, with hosts starting to regiter on the platform from 2010 and diminishing around 2020 due to COVID-19.

  • Which values are rare? Why? Does that match your expectations?

Values around 2010 are rare because AirBnB was not that popular at the time. In fact, AirBnB was founded in 2008. Thus, it matches our expectations, since the company became more popular starting from 2013.

  • Can you see any unusual patterns? What might explain them?

Not spotting any unusual pattern in addition to what has already been stated. No particular clusters to spot

1.2 Scatterplots

  • What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

The matrix drawn below gives us an idea of the correlation among variables, filtered by type of room as well. The correlations between variables are generally very different

For instance, in terms of those variables that are positively correlated, people and prices are positively correlated (coefficient of 0.311). The correlation for hotel rooms is higher (0.311), while that for shared rooms is lowest (0.046)

Host year and minimum nights are negatively correlated (coefficient of -0.084), that is, the minimum nights required for a booking by hosts over time have gradually diminished. This statements makes perfect sense in a competitive environement such as AirBnB. There is an exception within this analysis: hotel room types have an opposite trend, that is, minimum nights required have gradually increased with the passing of years

# Parsing price variable and filtering based on rational numbers for AirBnB prices

listings_1 <- listings %>% 
  mutate(price = parse_number(price))

listings_filtered <- listings_1 %>%
  filter(beds < 10 & minimum_nights <= 10 & price <= 250)

# Choosing variables of our interest in order to draw conclusions on relationships between them according to the previously stated filters and 

ggpairs(listings_filtered, 
        mapping = aes(col = room_type), 
        columns = c("host_since", "price", "minimum_nights", "longitude", "accommodates", "beds", "bedrooms"),
        columnLabels = c("Host year", "Price", "Minimum nights","Longitude", "People", "Beds", "Bedrooms")) +
  theme_bw() 

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

1.3 Data wrangling

Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.

Notice that some of the price data (price) is given as a character string, e.g., “$176.00”

Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number.

Use typeof(listing$price) to confirm that price is now stored as a number.

# Investigating the price variable has been already done above: cleaning currency in front of each value -> remember not to run parsing more than once otherwise you should start from scratch. 

#Checking that price observations are now stored as numeric, i.e. "double"

typeof(listings_1$price)
[1] "double"
# Checking on the distribution of prices: right-skewed, so we introduce a new variable to better capture price range in Milan and thus check its behavior

ggplot(data = listings_1) +
  geom_histogram(mapping = aes(x = price), binwidth = 20) +
  labs(title = "Distribution of prices") +
  xlab("Price") +
  ylab("Count")

listings_2 <- listings_1 %>% 
  mutate(logprice = 4*log(listings_1$price))

ggplot(data = listings_2) +
  geom_histogram(mapping = aes(x = logprice), binwidth = 1) +
  labs(title = "Distribution of prices") +
  xlab("Price") +
  ylab("Count")

# Detecting outliers and getting rid of them (negative prices, zero prices and extreme prices). A good range of data based on the below box plot seems between 10 and 25. Let's dig better into specific numbers with using the IQR rule and clean up outliers.

ggplot(data = listings_2) +
  geom_boxplot(mapping = aes(x = logprice)) +
  labs(title = "Detecting outliers")

favstats(listings_2$logprice)
minQ1medianQ3maxmeansdnmissing
8.791617.519.537.617.92.83177030
Q1 <- quantile(listings_2$logprice, .25)
Q3 <- quantile(listings_2$logprice, .75)
IQR <- IQR(listings_2$logprice)

no_outliers <- subset(listings_2, listings_2$logprice > (Q1 - 1.5*IQR) & listings_2$logprice < (Q3 + 1.5*IQR))

ggplot(data = no_outliers) +
  geom_boxplot(mapping = aes(x = logprice)) +
  labs(title = "Cleaned outliers")

1.4 Property types

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

skim(listings$property_type) # There are 52 unique property types among the observations
Data summary
Name listings$property_type
Number of rows 17703
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 4 35 0 52 0
# Properties ordered by frequency: the top 4 most common property types are "Entire rental unit", "Private room in rental unit", "Entire condominium (condo)", "Entire loft". These top 4 properties make up more than 85% of total properties types. 

property_count <- listings %>% 
  count(property_type) %>% 
  arrange(desc(n))

property_count %>% 
  mutate(property_sum = sum(property_count$n),
         property_perc = n / property_sum)
property_typenproperty_sumproperty_perc
Entire rental unit10178177030.575   
Private room in rental unit2661177030.15    
Entire condominium (condo)1830177030.103   
Entire loft833177030.0471  
Private room in condominium (condo)631177030.0356  
Entire residential home274177030.0155  
Entire serviced apartment189177030.0107  
Shared room in rental unit182177030.0103  
Private room in residential home168177030.00949 
Private room in bed and breakfast137177030.00774 
Private room in loft91177030.00514 
Room in boutique hotel56177030.00316 
Room in hotel56177030.00316 
Private room in serviced apartment36177030.00203 
Shared room in condominium (condo)33177030.00186 
Private room in villa30177030.00169 
Tiny house26177030.00147 
Room in aparthotel24177030.00136 
Entire guest suite22177030.00124 
Private room in guest suite21177030.00119 
Private room in townhouse20177030.00113 
Entire villa19177030.00107 
Room in bed and breakfast19177030.00107 
Entire townhouse18177030.00102 
Private room in hostel17177030.00096 
Room in serviced apartment17177030.00096 
Shared room in hostel17177030.00096 
Room in hostel12177030.000678
Private room10177030.000565
Entire place9177030.000508
Shared room in loft9177030.000508
Private room in tiny house8177030.000452
Shared room in residential home8177030.000452
Private room in casa particular5177030.000282
Shared room in bed and breakfast5177030.000282
Private room in guesthouse4177030.000226
Casa particular3177030.000169
Entire bed and breakfast3177030.000169
Entire guesthouse3177030.000169
Entire home/apt3177030.000169
Shared room in tiny house3177030.000169
Camper/RV2177030.000113
Dome house2177030.000113
Boat1177035.65e-05
Cave1177035.65e-05
Earth house1177035.65e-05
Island1177035.65e-05
Private room in camper/rv1177035.65e-05
Private room in cave1177035.65e-05
Private room in earth house1177035.65e-05
Private room in farm stay1177035.65e-05
Tipi1177035.65e-05

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

listings_3 <- listings_2 %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Entire loft") ~ property_type, 
    TRUE ~ "Other"
  ))

Use the code below to check that prop_type_simplified was correctly made.

listings_3 %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit10178
Private room in rental unitPrivate room in rental unit2661
Entire condominium (condo)Entire condominium (condo)1830
Entire loftEntire loft833
Private room in condominium (condo)Other631
Entire residential homeOther274
Entire serviced apartmentOther189
Shared room in rental unitOther182
Private room in residential homeOther168
Private room in bed and breakfastOther137
Private room in loftOther91
Room in boutique hotelOther56
Room in hotelOther56
Private room in serviced apartmentOther36
Shared room in condominium (condo)Other33
Private room in villaOther30
Tiny houseOther26
Room in aparthotelOther24
Entire guest suiteOther22
Private room in guest suiteOther21
Private room in townhouseOther20
Entire villaOther19
Room in bed and breakfastOther19
Entire townhouseOther18
Private room in hostelOther17
Room in serviced apartmentOther17
Shared room in hostelOther17
Room in hostelOther12
Private roomOther10
Entire placeOther9
Shared room in loftOther9
Private room in tiny houseOther8
Shared room in residential homeOther8
Private room in casa particularOther5
Shared room in bed and breakfastOther5
Private room in guesthouseOther4
Casa particularOther3
Entire bed and breakfastOther3
Entire guesthouseOther3
Entire home/aptOther3
Shared room in tiny houseOther3
Camper/RVOther2
Dome houseOther2
BoatOther1
CaveOther1
Earth houseOther1
IslandOther1
Private room in camper/rvOther1
Private room in caveOther1
Private room in earth houseOther1
Private room in farm stayOther1
TipiOther1

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?

The most common values for the variable are 1, 2 and 3 nights.

  • Is there any value among the common values that stands out?

There are observations which should obviously be deleted from the dataset, such as nights > 365 (that is, greater than a year). It may still be possible that some people booked an AirBnB for 1 year so we would rather keep that information

  • What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

The likely intended purpose for those AirBnBs with long bookings is long-term rental.

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

min_nights_count <- listings_3 %>% 
  count(minimum_nights) %>% 
  arrange(minimum_nights)

ggplot(data = min_nights_count) + 
  geom_histogram(mapping = aes(x = minimum_nights))

# Filtering bookings for nights below 4 

min_nights_count_1 <- listings_3 %>% 
  count(minimum_nights) %>% 
  filter(minimum_nights <= 4)

2 Mapping

Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualizations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

listings_regr <- listings_3 %>%
  filter(minimum_nights == 4 & accommodates == 2) %>% 
  mutate(price_4_nights = 4 * (price * accommodates),
         log_price_4_nights = log(price_4_nights)) # Variable that we want to explain

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

We should use log_price_4_nights given the more even distribution of prices, which in the case of price_4_nights is right-skewed and does not allow us to interpret data well.

skim(listings_regr$price_4_nights)
Data summary
Name listings_regr$price_4_nig…
Number of rows 274
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 906.34 805.04 160 410 640 1200 8000 ▇▁▁▁▁
# First histogram for price_4_nights: distribution difficult to interpret, hence we draw a log scale for the variable

ggplot(data = listings_regr) +
  geom_histogram(mapping = aes(x = price_4_nights))

# Second histogram for log(price_4_nights)

ggplot(data = listings_regr) +
  geom_histogram(mapping = aes(x = log_price_4_nights)) +
  labs(title = "Distribution of prices per four nights")

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

# Fitting regression model 1

model1 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = listings_regr)

summary(model1)

Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = listings_regr)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1349 -0.4711 -0.1103  0.4382  1.7703 

Coefficients:
                                                 Estimate Std. Error t value
(Intercept)                                      6.456708   0.244629  26.394
prop_type_simplifiedEntire loft                 -0.052152   0.230220  -0.227
prop_type_simplifiedEntire rental unit          -0.015456   0.143840  -0.107
prop_type_simplifiedOther                       -0.074000   0.202877  -0.365
prop_type_simplifiedPrivate room in rental unit -0.323364   0.167271  -1.933
number_of_reviews                               -0.003026   0.001068  -2.834
review_scores_rating                             0.027319   0.043202   0.632
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire loft                  0.82104    
prop_type_simplifiedEntire rental unit           0.91455    
prop_type_simplifiedOther                        0.71572    
prop_type_simplifiedPrivate room in rental unit  0.05476 .  
number_of_reviews                                0.00511 ** 
review_scores_rating                             0.52794    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5989 on 183 degrees of freedom
  (因为不存在,84个观察量被删除了)
Multiple R-squared:  0.07208,   Adjusted R-squared:  0.04166 
F-statistic: 2.369 on 6 and 183 DF,  p-value: 0.03149
autoplot(model1)

vif(model1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.036657  4        1.004510
number_of_reviews    1.025684  1        1.012760
review_scores_rating 1.028962  1        1.014378

Theory section: The higher the t-value, the greater the confidence we have in the coefficient as a predictor. Low t-values are indications of low reliability of the predictive power of that coefficient.

  • Interpret the coefficient review_scores_rating in terms of price_4_nights.

The coefficient “review_scores_rating” has a t-statistic of 0.632, therefore it is not significant in explaining the variability of log_price_4_nights. Hence, we can get rid of it.

  • Interpret the coefficient of prop_type_simplified in terms of price_4_nights.

The type of property is not a good predictor of the variability of prices given the very low t-values, therefore we cannot reject the null hypothesis. This means property type does not have significant explanatory power and we can disregard it as well.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

# Fitting regression model 2

model2 <- lm(log_price_4_nights ~ number_of_reviews + room_type, data = listings_regr) # Removed variables that were not statistically significant in model1 and added "room_type"

summary(model2)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type, 
    data = listings_regr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.62135 -0.54055 -0.05297  0.50385  2.29068 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.696519   0.047197 141.884  < 2e-16 ***
number_of_reviews     -0.004093   0.001090  -3.754 0.000213 ***
room_typePrivate room -0.341134   0.092837  -3.675 0.000287 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6421 on 271 degrees of freedom
Multiple R-squared:  0.08816,   Adjusted R-squared:  0.08143 
F-statistic:  13.1 on 2 and 271 DF,  p-value: 3.707e-06
autoplot(model2)

vif(model2)
number_of_reviews         room_type 
         1.002828          1.002828 

Model2 suggests that the type of rooms is a significant predictor of the variability of prices (t-value -3.675)

3.1 Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?

  2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

All the above variables were plugged in model3 in addition to those already existing. There doesn’t seem to be enough explanatory power in the newly added variables, so they can be dropped

# Sourcing the number of bathrooms per property from "bathrooms_text" and analyzing regression model

listings_regr_1 <- listings_regr %>%
  mutate(bathrooms_clean = parse_number(bathrooms_text))

model3 <- lm(log_price_4_nights ~ number_of_reviews + room_type + bathrooms_clean + bedrooms + beds + host_is_superhost, data = listings_regr_1)

summary(model3)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type + 
    bathrooms_clean + bedrooms + beds + host_is_superhost, data = listings_regr_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.65989 -0.51046 -0.03709  0.49041  2.25213 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.608930   0.279486  23.647  < 2e-16 ***
number_of_reviews     -0.003904   0.001384  -2.822  0.00519 ** 
room_typePrivate room -0.413553   0.099634  -4.151 4.66e-05 ***
bathrooms_clean        0.049490   0.179039   0.276  0.78247    
bedrooms               0.013302   0.204195   0.065  0.94812    
beds                   0.063346   0.091425   0.693  0.48908    
host_is_superhostTRUE -0.077034   0.148578  -0.518  0.60462    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6467 on 231 degrees of freedom
  (因为不存在,36个观察量被删除了)
Multiple R-squared:  0.107, Adjusted R-squared:  0.08385 
F-statistic: 4.615 on 6 and 231 DF,  p-value: 0.0001894
autoplot(model3)

vif(model3)
number_of_reviews         room_type   bathrooms_clean          bedrooms 
         1.198443          1.065264          1.042983          1.062875 
             beds host_is_superhost 
         1.102542          1.222614 
  1. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?

The option to book immediately an AirBnB listing does not have significant predictive power on price

model4 <- lm(log_price_4_nights ~ number_of_reviews + room_type + instant_bookable, data = listings_regr_1)

summary(model4)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type + 
    instant_bookable, data = listings_regr_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.58490 -0.51537 -0.05437  0.48198  2.15328 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.660069   0.050740 131.258  < 2e-16 ***
number_of_reviews     -0.004069   0.001085  -3.749 0.000217 ***
room_typePrivate room -0.363828   0.093163  -3.905 0.000119 ***
instant_bookableTRUE   0.173846   0.091524   1.899 0.058570 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.639 on 270 degrees of freedom
Multiple R-squared:  0.1002,    Adjusted R-squared:  0.09018 
F-statistic: 10.02 on 3 and 270 DF,  p-value: 2.778e-06
autoplot(model4)

vif(model4)
number_of_reviews         room_type  instant_bookable 
         1.002963          1.019597          1.017063 
  1. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variable neighbourhood_simplified and determine whether location is a predictor of price_4_nights

Location is a reliable predictor of price_4_nights as t-values are big enough for the east, north and south. The estimates of these three variables are approximately -0.5, indicating that the property price in these there areas are significantly cheaper than the baseline (central area). Although the west variable is not significant, it just indicates that the price of the west area is basically the same as that of the baseline (central area). Therefore, we still need to keep the neibourhood variable to help predict the total cost.

listings_regr_2 <- listings_regr_1

listings_regr_2$neighbourhood_simplified <- ifelse(
              listings_regr_2$neighbourhood_cleansed == "ADRIANO" | 
              listings_regr_2$neighbourhood_cleansed == "AFFORI" | 
              listings_regr_2$neighbourhood_cleansed == "BICOCCA" | 
              listings_regr_2$neighbourhood_cleansed == "BOVISA" | 
              listings_regr_2$neighbourhood_cleansed == "BOVISASCA"| 
              listings_regr_2$neighbourhood_cleansed == "BRUZZANO" | 
              listings_regr_2$neighbourhood_cleansed == "COMASINA" | 
              listings_regr_2$neighbourhood_cleansed == "DERGANO" | 
              listings_regr_2$neighbourhood_cleansed == "FARINI" | 
              listings_regr_2$neighbourhood_cleansed == "GHISOLFA" | 
              listings_regr_2$neighbourhood_cleansed == "GRECO" | 
              listings_regr_2$neighbourhood_cleansed == "LORETO" | 
              listings_regr_2$neighbourhood_cleansed == "MACIACHINI - MAGGIOLINA" |
              listings_regr_2$neighbourhood_cleansed == "MAGGIORE - MUSOCCO" | 
              listings_regr_2$neighbourhood_cleansed == "NIGUARDA - CA' GRANDA" |
              listings_regr_2$neighbourhood_cleansed == "PADOVA" | 
              listings_regr_2$neighbourhood_cleansed == "PAGANO" | 
              listings_regr_2$neighbourhood_cleansed == "PARCO LAMBRO - CIMIANO" | 
              listings_regr_2$neighbourhood_cleansed == "PARCO NORD" | 
              listings_regr_2$neighbourhood_cleansed == "QUARTO OGGIARO" | 
              listings_regr_2$neighbourhood_cleansed == "VIALE MONZA" | 
              listings_regr_2$neighbourhood_cleansed == "VILLAPIZZONE", "NORTH",
              ifelse(
                listings_regr_2$neighbourhood_cleansed == "BAGGIO" |
                listings_regr_2$neighbourhood_cleansed == "PORTELLO" |
                listings_regr_2$neighbourhood_cleansed == "QT 8" |
                listings_regr_2$neighbourhood_cleansed == "QUARTO CAGNINO" |
                listings_regr_2$neighbourhood_cleansed == "QUINTO ROMANO" |
                listings_regr_2$neighbourhood_cleansed == "BANDE NERE" |
                listings_regr_2$neighbourhood_cleansed == "FIGINO" |
                listings_regr_2$neighbourhood_cleansed == "GIAMBELLINO" |
                listings_regr_2$neighbourhood_cleansed == "LORENTEGGIO" |
                listings_regr_2$neighbourhood_cleansed == "MUGGIANO" |
                listings_regr_2$neighbourhood_cleansed == "PARCO BOSCO IN CITT\u0085" |
                listings_regr_2$neighbourhood_cleansed == "S. CRISTOFORO" |
                listings_regr_2$neighbourhood_cleansed == "S. SIRO" |
                listings_regr_2$neighbourhood_cleansed == "SACCO" |
                listings_regr_2$neighbourhood_cleansed == "SELINUNTE" |
                listings_regr_2$neighbourhood_cleansed == "TRE TORRI" |
                listings_regr_2$neighbourhood_cleansed == "TRENNO", "WEST",
                ifelse(
                  listings_regr_2$neighbourhood_cleansed == "BARONA" |
                  listings_regr_2$neighbourhood_cleansed == "CANTALUPA" |
                  listings_regr_2$neighbourhood_cleansed == "CHIARAVALLE" |
                  listings_regr_2$neighbourhood_cleansed == "EX OM - MORIVIONE" |
                  listings_regr_2$neighbourhood_cleansed == "GRATOSOGLIO - TICINELLO" |
                  listings_regr_2$neighbourhood_cleansed == "LODI - CORVETTO" |
                  listings_regr_2$neighbourhood_cleansed == "PARCO AGRICOLO SUD" |
                  listings_regr_2$neighbourhood_cleansed == "PARCO DEI NAVIGLI" |
                  listings_regr_2$neighbourhood_cleansed == "PARCO DELLE ABBAZIE" |
                  listings_regr_2$neighbourhood_cleansed == "QUINTOSOLE" |
                  listings_regr_2$neighbourhood_cleansed == "RIPAMONTI" |
                  listings_regr_2$neighbourhood_cleansed == "RONCHETTO SUL NAVIGLIO" |
                  listings_regr_2$neighbourhood_cleansed == "SCALO ROMANA" |
                  listings_regr_2$neighbourhood_cleansed == "STADERA" |  
                  listings_regr_2$neighbourhood_cleansed == "TIBALDI" |
                  listings_regr_2$neighbourhood_cleansed == "TRIULZO SUPERIORE" |
                  listings_regr_2$neighbourhood_cleansed == "VIGENTINA", "SOUTH",
                  ifelse(
                    listings_regr_2$neighbourhood_cleansed == "CITTA' STUDI" |
                    listings_regr_2$neighbourhood_cleansed == "CORSICA" |
                    listings_regr_2$neighbourhood_cleansed == "FORZE ARMATE" |
                    listings_regr_2$neighbourhood_cleansed == "GALLARATESE" |
                    listings_regr_2$neighbourhood_cleansed == "LAMBRATE" |
                    listings_regr_2$neighbourhood_cleansed == "MECENATE" |
                    listings_regr_2$neighbourhood_cleansed == "ORTOMERCATO" |
                    listings_regr_2$neighbourhood_cleansed == "PARCO FORLANINI - ORTICA" |
                    listings_regr_2$neighbourhood_cleansed == "PARCO MONLUE' - PONTE LAMBRO" |
                    listings_regr_2$neighbourhood_cleansed == "ROGOREDO" |
                    listings_regr_2$neighbourhood_cleansed == "UMBRIA - MOLISE", "EAST",
                    ifelse(
                      listings_regr_2$neighbourhood_cleansed == "BRERA" |
                      listings_regr_2$neighbourhood_cleansed == "BUENOS AIRES - VENEZIA" |
                      listings_regr_2$neighbourhood_cleansed == "CENTRALE" |
                      listings_regr_2$neighbourhood_cleansed == "DE ANGELI - MONTE ROSA" |
                      listings_regr_2$neighbourhood_cleansed == "DUOMO" |
                      listings_regr_2$neighbourhood_cleansed == "GARIBALDI REPUBBLICA" |
                      listings_regr_2$neighbourhood_cleansed == "GIARDINI PORTA VENEZIA" |
                      listings_regr_2$neighbourhood_cleansed == "GUASTALLA" |
                      listings_regr_2$neighbourhood_cleansed == "ISOLA" |
                      listings_regr_2$neighbourhood_cleansed == "MAGENTA - S. VITTORE" |
                      listings_regr_2$neighbourhood_cleansed == "NAVIGLI" |
                      listings_regr_2$neighbourhood_cleansed == "PARCO SEMPIONE" |
                      listings_regr_2$neighbourhood_cleansed == "PORTA ROMANA" |
                      listings_regr_2$neighbourhood_cleansed == "SARPI" |
                      listings_regr_2$neighbourhood_cleansed == "TICINESE" |
                      listings_regr_2$neighbourhood_cleansed == "TORTONA" |
                      listings_regr_2$neighbourhood_cleansed == "WASHINGTON" |
                      listings_regr_2$neighbourhood_cleansed == "XXII MARZO", "CENTRAL", "-")))))

skim(listings_regr_2$neighbourhood_simplified)
Data summary
Name listings_regr_2$neighbour…
Number of rows 274
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 4 7 0 5 0
model5 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified, data = listings_regr_2)

summary(model5)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type + 
    neighbourhood_simplified, data = listings_regr_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20515 -0.45724 -0.04588  0.37618  2.11966 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.866157   0.053392 128.598  < 2e-16 ***
number_of_reviews             -0.003668   0.001020  -3.596 0.000384 ***
room_typePrivate room         -0.280723   0.087624  -3.204 0.001521 ** 
neighbourhood_simplifiedEAST  -0.497822   0.132211  -3.765 0.000205 ***
neighbourhood_simplifiedNORTH -0.452489   0.094957  -4.765 3.10e-06 ***
neighbourhood_simplifiedSOUTH -0.585835   0.129768  -4.514 9.53e-06 ***
neighbourhood_simplifiedWEST   0.001378   0.140028   0.010 0.992155    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.599 on 267 degrees of freedom
Multiple R-squared:  0.2182,    Adjusted R-squared:  0.2006 
F-statistic: 12.42 on 6 and 267 DF,  p-value: 2.445e-12
autoplot(model5)

vif(model5)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        1.008025  1        1.004004
room_type                1.026607  1        1.013216
neighbourhood_simplified 1.028335  4        1.003499
  1. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?

Based on model6 below, coefficient “availability_30” is significant in explaining variability of prices, whereas “reviews_per_month” isn’t. For this reason further model6.1 and model 6.2 are computed, and as model 6.2 has a higher adjusted r² with all regressors significant, we think it outperforms model6.1.

model6 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified + availability_30 + reviews_per_month, data = listings_regr_2)

summary(model6)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type + 
    neighbourhood_simplified + availability_30 + reviews_per_month, 
    data = listings_regr_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.95971 -0.33851 -0.06812  0.26488  1.71802 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.6285326  0.0648658 102.188  < 2e-16 ***
number_of_reviews             -0.0001663  0.0016725  -0.099 0.920881    
room_typePrivate room         -0.2794309  0.0905326  -3.087 0.002344 ** 
neighbourhood_simplifiedEAST  -0.4050995  0.1218563  -3.324 0.001073 ** 
neighbourhood_simplifiedNORTH -0.4511803  0.0994412  -4.537 1.04e-05 ***
neighbourhood_simplifiedSOUTH -0.4671256  0.1241691  -3.762 0.000227 ***
neighbourhood_simplifiedWEST  -0.3447407  0.1778660  -1.938 0.054154 .  
availability_30                0.0192026  0.0034246   5.607 7.58e-08 ***
reviews_per_month             -0.1198490  0.0964339  -1.243 0.215546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5067 on 181 degrees of freedom
  (因为不存在,84个观察量被删除了)
Multiple R-squared:  0.3429,    Adjusted R-squared:  0.3138 
F-statistic:  11.8 on 8 and 181 DF,  p-value: 1.784e-13
autoplot(model6)

vif(model6)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        3.514313  1        1.874650
room_type                1.061876  1        1.030473
neighbourhood_simplified 1.117805  4        1.014018
availability_30          1.032053  1        1.015900
reviews_per_month        3.650668  1        1.910672
# In model 6, after adding review_per_month, the number_of_reviews become not significant, indicating that there is a slight collinearity between the two variables. Hence, we need to compare the two variables.

model6.1 <- lm(log_price_4_nights ~ room_type + neighbourhood_simplified + availability_30 + reviews_per_month, data = listings_regr_2) 

summary(model6.1)

Call:
lm(formula = log_price_4_nights ~ room_type + neighbourhood_simplified + 
    availability_30 + reviews_per_month, data = listings_regr_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9593 -0.3345 -0.0696  0.2640  1.7188 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.628919   0.064573 102.658  < 2e-16 ***
room_typePrivate room         -0.280332   0.089832  -3.121 0.002099 ** 
neighbourhood_simplifiedEAST  -0.404990   0.121519  -3.333 0.001042 ** 
neighbourhood_simplifiedNORTH -0.450660   0.099033  -4.551 9.75e-06 ***
neighbourhood_simplifiedSOUTH -0.465345   0.122537  -3.798 0.000199 ***
neighbourhood_simplifiedWEST  -0.346988   0.175945  -1.972 0.050109 .  
availability_30                0.019210   0.003415   5.626 6.87e-08 ***
reviews_per_month             -0.127902   0.052235  -2.449 0.015290 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5054 on 182 degrees of freedom
  (因为不存在,84个观察量被删除了)
Multiple R-squared:  0.3428,    Adjusted R-squared:  0.3175 
F-statistic: 13.56 on 7 and 182 DF,  p-value: 4.638e-14
autoplot(model6.1)

vif(model6.1)
                             GVIF Df GVIF^(1/(2*Df))
room_type                1.051233  1        1.025296
neighbourhood_simplified 1.070531  4        1.008556
availability_30          1.031628  1        1.015691
reviews_per_month        1.076982  1        1.037778
model6.2 <- lm(log_price_4_nights ~ number_of_reviews + room_type + neighbourhood_simplified + availability_30, data = listings_regr_2) # Model 6.2 has a higher adjusted r², therefore we keep this one

summary(model6.2)

Call:
lm(formula = log_price_4_nights ~ number_of_reviews + room_type + 
    neighbourhood_simplified + availability_30, data = listings_regr_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03538 -0.35730 -0.07754  0.29912  2.40632 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.6738487  0.0549696 121.410  < 2e-16 ***
number_of_reviews             -0.0025690  0.0009399  -2.733 0.006695 ** 
room_typePrivate room         -0.2878995  0.0797799  -3.609 0.000368 ***
neighbourhood_simplifiedEAST  -0.4410529  0.1206043  -3.657 0.000308 ***
neighbourhood_simplifiedNORTH -0.5011355  0.0866931  -5.781 2.08e-08 ***
neighbourhood_simplifiedSOUTH -0.5632983  0.1181807  -4.766 3.09e-06 ***
neighbourhood_simplifiedWEST  -0.0929711  0.1281032  -0.726 0.468629    
availability_30                0.0208242  0.0027794   7.492 1.00e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5453 on 266 degrees of freedom
Multiple R-squared:  0.3544,    Adjusted R-squared:  0.3375 
F-statistic: 20.86 on 7 and 266 DF,  p-value: < 2.2e-16
autoplot(model6.2)

vif(model6.2)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        1.033172  1        1.016451
room_type                1.026755  1        1.013289
neighbourhood_simplified 1.051702  4        1.006321
availability_30          1.048092  1        1.023764

3.2 Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using autoplot(model_x).

Done for each model

  1. As you start building models with more explanatory variables, make sure you use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 5 or 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.

All models have low VIF

  1. Create a summary table, using huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
huxreg(model1, model2, model3, model4, model5, model6.2)
(1)(2)(3)(4)(5)(6)
(Intercept)6.457 ***6.697 ***6.609 ***6.660 ***6.866 ***6.674 ***
(0.245)   (0.047)   (0.279)   (0.051)   (0.053)   (0.055)   
prop_type_simplifiedEntire loft-0.052                                            
(0.230)                                           
prop_type_simplifiedEntire rental unit-0.015                                            
(0.144)                                           
prop_type_simplifiedOther-0.074                                            
(0.203)                                           
prop_type_simplifiedPrivate room in rental unit-0.323                                            
(0.167)                                           
number_of_reviews-0.003 ** -0.004 ***-0.004 ** -0.004 ***-0.004 ***-0.003 ** 
(0.001)   (0.001)   (0.001)   (0.001)   (0.001)   (0.001)   
review_scores_rating0.027                                            
(0.043)                                           
room_typePrivate room        -0.341 ***-0.414 ***-0.364 ***-0.281 ** -0.288 ***
        (0.093)   (0.100)   (0.093)   (0.088)   (0.080)   
bathrooms_clean                0.049                            
                (0.179)                           
bedrooms                0.013                            
                (0.204)                           
beds                0.063                            
                (0.091)                           
host_is_superhostTRUE                -0.077                            
                (0.149)                           
instant_bookableTRUE                        0.174                    
                        (0.092)                   
neighbourhood_simplifiedEAST                                -0.498 ***-0.441 ***
                                (0.132)   (0.121)   
neighbourhood_simplifiedNORTH                                -0.452 ***-0.501 ***
                                (0.095)   (0.087)   
neighbourhood_simplifiedSOUTH                                -0.586 ***-0.563 ***
                                (0.130)   (0.118)   
neighbourhood_simplifiedWEST                                0.001    -0.093    
                                (0.140)   (0.128)   
availability_30                                        0.021 ***
                                        (0.003)   
N190        274        238        274        274        274        
R20.072    0.088    0.107    0.100    0.218    0.354    
logLik-168.613    -265.892    -230.403    -264.073    -244.811    -218.579    
AIC353.226    539.783    476.805    538.146    505.621    455.158    
*** p < 0.001; ** p < 0.01; * p < 0.05.
  1. Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90 assumption: review_scores_rating >= 4.5. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.

The best model according to the above huxtable is model6.2, which is the model with the highest R-squared (0.354) and regressors are not colinear with significant power to explain the Y, i.e. the model that best fits our observations.

After conducting antilog and point prediction, we draw a line plot to compare the prediction and real total costs. As considering our selection criteria, there are only 9 properties left, so we set the x-axis as the index/id of the property (0-8). We can see that the fitted line flctuates around the real line, which is within the range of 95% confidenc einterval.

# We filter the original dataset to get our potential choices
listings_regr_3 <- listings_regr_2 %>% 
  filter(room_type == "Private room" & number_of_reviews >= 10 & review_scores_rating >= 4.5)

# Then we create data frame containing variables the regression model needs

options <- listings_regr_3[c('number_of_reviews', 'room_type', 'neighbourhood_simplified', 'availability_30')]

# As stated before, we would choose model 6.2 as our best model to predict here

log_predictions <- predict(model6.2, options, interval = "confidence", level = 0.95)

# Antilog to get the total costs

predictions <- exp(log_predictions)
predictions
       fit      lwr       upr
1 578.3963 490.8187  681.6006
2 976.8259 741.8942 1286.1520
3 280.0911 213.6477  367.1980
4 323.4941 209.8639  498.6492
5 415.3426 346.1236  498.4043
6 548.0194 463.4663  647.9979
7 267.9445 189.4118  379.0378
8 342.4208 266.4219  440.0990
9 447.4150 357.8257  559.4349
# Plot the point predictions with confidence intervals and real values

ggplot(predictions, aes(y = fit, x = c(0,1,2,3,4,5,6,7,8))) + 
  geom_ribbon(aes(ymin = fit - lwr,
                  ymax = fit + upr,
                  xmin = 0,
                  xmax = 8),
              fill = "lightblue") + 
  scale_x_continuous(expand = c(0, 0), limits = c(0, 8), breaks = c(1,2,3,4,5,6,7,8)) +
  geom_line(data = listings_regr_3, 
            mapping = aes(y = price * 2 * 4, x = c(0,1,2,3,4,5,6,7,8),
                          color = 'black')) +
  geom_line(data = predictions, 
            mapping = aes(y = fit, x = c(0,1,2,3,4,5,6,7,8),
                          color = 'red')) +
  scale_colour_manual(labels = c('real', 'predict'), values = c('red', 'black')) +
  labs(title = 'Predict and Real Total Costs for 2 Person 4 Night',
       x = 'Index',
       y = 'Total Cost') +
  theme_bw() +
  theme(legend.title=element_blank()) +
  NULL

4 Deliverables

  • By midnight on Monday 18 Oct 2021, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
  1. Executive Summary: Based on your best model, indicate the factors that influence price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.
  2. Data Exploration and Feature Selection: Present key elements of the data, including tables and graphs that help the reader understand the important variables in the dataset. Describe how the data was cleaned and prepared, including feature selection, transformations, interactions, and other approaches you considered.
  3. Model Selection and Validation: Describe the model fitting and validation process used. State the model you selected and why they are preferable to other choices.
  4. Findings and Recommendations: Interpret the results of the selected model and discuss additional steps that might improve the analysis

Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates; in other words, (i.e. I don’t want to see stuff I wrote in your final report.)

5 Rubric

Your work will be assessed on a rubric which you can find here

6 Acknowledgements